Q-ViT: Accurate and Fully Quantized Low-Bit Vision Transformer

21

  

 

 

 

 

  

Block.0.query

Block.3.query

Block.6.query

(b) Q-ViT

  

 

  

  

Block.0.query

Block.3.query

Block.6.query

 

  

(a) Full-Precision

FIGURE 2.3

The histogram of query values q (shadow) along with the PDF curve of Gaussian distri-

bution N(μ, σ2) [195], for three selected layers in DeiT-T and 4-bit fully quantized DeiT-T

(baseline). μ and σ2 are the statistical mean and variance of the values.

For ease of training, the input to the matrix multiplication layers is set to ˆv, mathe-

matically equivalent to the inference operations described earlier. The input activations and

weights are set to 2, 3, 4, or 8 bits for all matrix multiplication layers except the first and

last, which are always set to 8 bits. This standard practice in quantized networks has been

shown to improve performance significantly. All other parameters are represented using

FP32. Quantized networks are initialized using weights from a trained full-precision model

with a similar architecture before being fine-tuned in the quantized space.

2.3

Q-ViT: Accurate and Fully Quantized Low-Bit Vision

Transformer

Inspired by the success of natural language processing (NLP), transformer-based mod-

els have shown great power in various computer vision (CV) tasks, such as image clas-

sification [60] and object detection [31]. Pre-trained with large-scale data, these mod-

els usually have many parameters. For example, 632M parameters consume 2528 MB of

memory usage and 162G FLOPs in the ViT-H model, which is expensive in both mem-

ory and computation during inference. This limits the deployment of these models on

resource-limited platforms. Therefore, compressed transformers are urgently needed for real

applications.

Quantization-aware training (QAT) [158] methods perform quantization during back-

propagation and achieve much less performance drop with a higher compression rate in

general. QAT is effective for CNN models [159] for CV tasks. However, QAT methods still

need to be explored for low-bit quantization of vision transformers. Therefore, we first build

a fully quantized ViT baseline, a straightforward yet effective solution based on standard

techniques. Our study discovers that the performance drop of fully quantized ViT lies in the

information distortion among the attention mechanism in the forward process and the in-

effective optimization for eliminating the distribution difference through distillation in the

backward propagation. First, the ViT attention mechanism aims to model long-distance

dependencies [227, 60]. However, our analysis shows that a direct quantization method

leads to information distortion and a significant distribution variation for the query mod-

ule between the quantized ViT and its full-precision counterpart. For example, as shown

in Fig. 2.3, the variance difference is 0.4409 (1.2124 vs. 1.6533) for the first block 1. This

1 supports the Gaussian distribution hypothesis citeqin2022bibert